SSHAARP: Searching Shared HLA Amino Acid Residue Prevalence

Livia Tran - livtran@chori.org

July 12, 2019

Overview

The Searching Shared HLA Amino Acid Residue Prevalence (SSHAARP) Package processes amino acid alignments produced by the IPD-IMGT/HLA Database to identify user-defined amino acid residue motifs shared across HLA alleles, and calculates the frequencies of those motifs based on HLA allele frequency data. SSHAARP uses Generic Mapping Tools (GMT) software and the GMT R Package to generate global frequency heat maps that illustrate the distribution of each user-defined map around the globe. SSHAARP analyzes the allele frequency data described in Solberg et al. 2008, a global set of 497 population samples from 185 published datasets, representing 66,800 individuals.

Future plans for SSHAARP include producing heat maps based on allelic frequencies obtained from the Allele Frequency Net Database (AFND).

Latest amino acid alignments can be found at https://github.com/ANHIG/IMGTHLA/tree/Latest/alignments.

The Solberg (Solberg et al, “Balancing selection and heterogeneity across the classical human leukocyte antigen loci: A meta-analytic review of 497 population studies”, Hum Immunol (2008), doi: 10.1016/j.humimm.2008.05.001) dataset can be found at http://pypop.org/popdata/ under results.zip.

SSHAARP has three main families of functions that are based on which pieces of information/software they interact with:

  1. IPD-IMGT/HLA:

    countSpaces counts the number of whitespaces in the ANHIG alignment files to discern where the alignment start is.

    AA_segments_maker extracts alignment sequence information for a given locus from the ANHIG/IMGTHLA Github Repository to produce a dataframe, which contains individual amino acid data for each amino acid position for all alleles, for a user-defined HLA locus or loci.

    motif_finder isolates the individual amino acid position dataframe produced by AA_segments_maker to only alleles with the user-defined motif.

  2. Solberg Dataset:

    coordinate_converter converts coordinates in the Solberg dataset with cardinal directions to their appropriate enumerations (i.e. 50S is converted to -50)

    dataSubset manipulates the Solberg dataset by adding in a new column containing locus*allele information, reordering the dataset based on population name, and subsetting the data to only the locus of interest.

  3. Generic Mapping Tools (GMT)

    PALM produces a heatmap from the further manipulated Solberg dataset by using the Generic Mapping Tools (GMT) R Package, which is an interface between R and the GMT Map-Making software. GMT commands are called via a Bash script through using the gmt.system() function in the GMT R package. Users must have the GMT R package AND GMT software installed, which is downloadable here: https://www.soest.hawaii.edu/gmt/.


Functions

AA_segments_maker

AA_segments_maker extracts alignment sequence information for a given locus from the ANHIG/IMGTHLA Github Repository to produce a dataframe, which contains individual amino acid data for each amino acid position for all alleles, for a user-defined HLA locus or loci. The first 4 columns are locus, allele, trimmed allele, and allele_name.

Example output of AA_segments_maker for HLA-A, where the selected row outputs are 2, 549, and 800 to illustrate different HLA-A alleles, and selected columns are the first 20 columns of the dataframe, where the first 4 columns consist of the locus, allele, trimmed_allele of 2 fields, and the full allele name. The rest of the columns consist of corresponding amino acids for each amino acid position for a given allele.

>AA_segments_maker("A")[[1]][c(2,8),1:20]
    locus      allele trimmed_allele   allele_name 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2       A 01:01:01:01        A*01:01 A*01:01:01:01 M A V M A P R T L  L  L  L  L  S  G  A
549     A    02:01:30        A*02:01    A*02:01:30 M A V M A P R T L  V  L  L  L  S  G  A
800     A    02:30:01        A*02:30    A*02:30:01 M A V M A P R T L  V  L  L  L  S  G  A

The first row of AA_segments_maker is a correspondence table of amino acid position enumerations. Alignments begin with a negative enumeration for the the amino acid positions within the first exon, which encodes the leader peptide, since the leader peptide is eventually cleaved off. The rest of the amino acid positions associated with exons other than exon 1 have positive enumerations. In addition, the alignment counts do not give amino acid positions with inDels a count. The correspondence table was created to begin enumeration at position 1, since it is more standard, and to account for the skipped over positions with inDels. Below is an example of what the correspondence table looks like – the alignment sequence for the first amino acid postion, -24, corresponds to a more standard beginning enumeration of 1.

>AA_segments_maker("A")[[1]][1,1:20]
  locus allele trimmed_allele allele_name   1   2   3   4   5   6   7   8   9  10  11  12
1  <NA>   <NA>           <NA>        <NA> -24 -23 -22 -21 -20 -19 -18 -17 -16 -15 -14 -13
   13  14  15 16
1 -12 -11 -10 -9
###

motif_finder

motif_finder splits up the input motif (should be in format LocusAA_1AA_2AA_3, where the amino acid positions are in numerical order). If the motif is not in numerical order, such as DRB130Y28E26F, the function will rearrange the motif such that the 26F is searched for first. The following output is an example of the 1st, 5th, and 9th DRB1 alleles that have the DRB1*26F28E30Y motif. Recall the dataframe is formatted with actual sequence enumeration (starting with 1), but the search is done by matching the input motif, which is based off of alignment enumeration, with the correspondence table. Thus, in the output dataframe, alignment position 26 is equivalent to actual position 55, 28 to 58, and 30 to 60.

>motif_finder("DRB1*26F~28E~30Y")[c(1,5,9,10), 1:65]

    locus   allele trimmed_allele   allele_name   1   2   3   4   5   6   7   8   9  10  11
1    <NA>     <NA>           <NA>          <NA> -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 -19
197  DRB1 03:02:04     DRB1*03:02 DRB1*03:02:04   *   *   *   *   *   *   *   *   *   *   *
240  DRB1    03:38     DRB1*03:38    DRB1*03:38   *   *   *   *   *   *   *   *   *   *   *
256  DRB1    03:53     DRB1*03:53    DRB1*03:53   *   *   *   *   *   *   *   *   *   *   *
     12  13  14  15  16  17  18  19  20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
1   -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1  1  2  3  4  5  6  7  8
197   *   *   *   *   *   *   *   *   *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  R  F  L
240   *   *   *   *   *   *   *   *   *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  R  F  L
256   *   *   *   *   *   *   *   *   *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  R  F  L
    38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55      56 57 58 59 60 61
1    9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 InDel_1 27 28 29 30 31
197  E  Y  S  T  S  E  C  H  F  F  N  G  T  E  R  V  R  F       .  L  E  R  Y  F
240  E  Y  S  T  S  E  C  H  F  F  N  G  T  E  R  V  R  F       .  L  E  R  Y  F
256  E  Y  S  T  S  E  C  H  F  F  N  G  T  E  R  V  R  F       .  L  E  R  Y  F

If a motif is not found, an error message is output:

>motif_finder("DRB1*26F~28E~30Z")

Warning message:
In motif_finder("DRB1*26F~28E~30Z") :
  Error - zero alleles match this motif. Please try again.

dataSubset

dataSubset manipulates the Solberg dataset to prepare for heatmap generation by adding in a new column containing locus*allele information, reordering the dataset based on population name, and subsetting the dataset to only the locus of interest.

Partial depiction of Solberg Dataset before dataSubset - note population names are out of order, and the data consists of data from multiple HLA loci

>dataset[c(1,2,300,600, 1000),]

     dataset     popname contin complex   latit  longit           coord locus allele_v2
1        lit Korean_1999    NEA       3  37.00N 127.30E  37.00N 127.30E  DRB1      1302
2        lit  Gabes_2006    NAF       2     34N     10E         34N 10E  DRB1      1311
300     ihwg      Ami_97    SEA       1 25.067N 121.65E 25.067N 121.65E     B      5601
600     ihwg   Brazilian    OTH    3mig     10S     55W         10S 55W     B      5001
1000    ihwg     Chaouya    NAF       2   35.5N      8W        35.5N 8W     A      2902
     allele_v3 allele.freq allele.count gametes
1        13:02     0.11001          111    1009
2        13:11     0.00526            1     190
300      56:01     0.20408           40     196
600      50:01     0.03371            6     178
1000     29:02     0.06716            9     134
Partial depiction of Solberg Dataset after dataSubset - note population names are now in lexicographic order, the only locus of note in the dataset is the locus provided in the motif (i.e. DRB1), and there is a new column, locus_allele, for comparison to the dataframe produced by AA_segments_maker.

>dataSubset(dataset, "DRB1*26F~28E~30Y")[c(1,2,300,600, 1000),]

      dataset        popname contin complex  latit longit         coord locus allele_v2
8615      lit      Ache_2003    SAM       1    24S    56W       24S 56W  DRB1      0403
8616      lit      Ache_2003    SAM       1    24S    56W       24S 56W  DRB1      0411
9365      lit      Bari_1994    SAM       2    10N    70W       10N 70W  DRB1      0403
10068     lit Cameroon_2001b    SSA       3  6.00N 12.00E  6.00N 12.00E  DRB1      1102
10679     lit     Czech_1992    EUR       3 49.45N 15.30E 49.45N 15.30E  DRB1      0301
      allele_v3 allele.freq allele.count gametes locus_allele
8615      04:03     0.00578            1     173   DRB1*04:03
8616      04:11     0.74566          129     173   DRB1*04:11
9365      04:03     0.03636            4     110   DRB1*04:03
10068     11:02     0.03175            8     252   DRB1*11:02
10679     03:01     0.11111           21     189   DRB1*03:01

coordinate_converter

coordinate_converter converts coordinates in the Solberg dataset with a cardinal direction to its appropriate numerical enumeration for plotting on a map. North (N), which is above the equator, and East (E), which is east of the prime meridian are positive, while South (S), below the equator, and West (W), west of the prime meridian, are negative.

Partial depiction of Solberg Dataset before coordinate_converter

                   popname contin latit longit
8615             Ache_2003    SAM   24S    56W
101            Algerian_99    NAF   35N   0.5W
17938 BrazilGuaraniNandeva    SAM   23S    54W
Partial depiction of Solberg Dataset after coordinate_converter

>coordinate_converter(heatmapdata)

                   popname contin latit longit
8615             Ache_2003    SAM   -24    -56
101            Algerian_99    NAF   35     -0.5
17938 BrazilGuaraniNandeva    SAM   -23    -54

countSpaces

countSpaces splits a given string, and white spaces are counted in between split elements. This function was used to count whitespaces in the alignment data downloaded from the ANHIG Github repository to determine the starting amino acid position. See below:

Prot              -30                              1
                   |                                |
 A*01:01:01:01           MAVM APRTLLLLLS GALALTQTWA GSHSMRYFFT SVSRPGRGEP RFIAVGYVDD 
 A*01:01:01:02N          ---- ---------- ---------- ---------- ---------- ---------- 
 A*01:01:01:03           ---- ---------- ---------- ---------- ---------- ---------- 
 A*01:01:01:04           ---- ---------- ---------- ---------- ---------- ---------- 
 A*01:01:01:05           ---- ---------- ---------- ---------- ---------- ---------- 
 A*01:01:01:06           ---- ---------- ---------- ---------- ---------- ---------- 

Alignment files from the ANHIG repository are formatted such that a certain number of characters must be present in a given line to maintain the alignment structure, so the start position is not immediately provided in the file. To avoid hard coding a start position, countSpaces was used to calculate the number of spaces between the pipe, which is right underneath the starting count (-30). In order to accomplish this, the following items were needed: 1)the number of spaces between the word “Prot” and starting count, “-30”, 2)the number of spaces from the end of the first allele name to the beginning of the amino acid start position 3)the number of characters in the first allele where the number of spaces between -30 and the beginning of the first amino acid were calculated by adding item 3 with item 2, and subtracting the sum of that from item 1.

#number of whitespaces counted for row 3 in the alignment file 
>countSpaces(alignment[[1]][3])
[1]  1 11  1  1  1  1  1  1  1  1  1

PALM

PALM produces a heatmap from the manipulated Solberg dataset by using the Generic Mapping Tools (GMT) R Package, which is an interface between R and the GMT Map-Making software. GMT software is required for this function and can be downloaded at https://www.soest.hawaii.edu/gmt/. GMT commands are called via a Bash script through using the gmt.system() function in the GMT R package. The final output of PALM is a heatmap saved as a JPEG in the user’s working directory.

Two default parameters are set for PALM: 1) color = TRUE Should the heatmap produced be in color? The default for color is set to true for a color heatmap – for a map in grayscale, set color==FALSE. 2) filter_migrant = TRUE Should OTH and migrant populations be filtered out of the dataset? The default for filter_migrant is set to true – set this parameter equal to FALSE to keep migrant and OTH populations in the dataset.

#JPEG with color
>PALM(Solberg_datset, "DRB1*26F~28E~30Y" )
...

#JPEG without color
>PALM(Solberg_datset, "DRB1*26F~28E~30Y", color=FALSE)
...